Morpheme Segmentation Gold Standards for Finnish and English
نویسندگان
چکیده
This document describes Hutmegs, the Helsinki University of Technology Morphological Evaluation Gold Standard package, which contains gold-standard morphological segmentations for 1.4 million Finnish and 120 000 English words. The Gold Standards comprise surface-string, or allomorph, segmentations of word forms, as well as deep-level, or morpheme, segmentations of the words. The segmentations have been produced semi-automatically and are based on existing resources: the two-level morphological analyzer for Finnish (FINTWOL) and the English CELEX database. For some cases where the transition between two morphemes does not appear clear-cut, so called “fuzzy morpheme boundaries” have been marked as an option. The Hutmegs package also contains some evaluation scripts allowing the user to compute the accuracy compared to the Gold Standard of a segmentation produced by some morphologylearning algorithm. The use of Hutmegs is free for academic purposes, but in order to access the gold-standard segmentations, inexpensive licenses must be purchased from Lingsoft Inc. (for Finnish) and the Linguistic Data Consortium (for English).
منابع مشابه
Morfessor and Hutmegs: Unsupervised Morpheme Segmentation for Highly-inflecting and Compounding Languages
In this work, we announce the Morfessor 1.0 software package, which is a program that takes as input a corpus of raw text and produces a segmentation of the word forms observed in the text. The segmentation obtained often resembles a linguistic morpheme segmentation. In addition, we briefly describe the Hutmegs package, also publicly available for research purposes. Hutmegs contains semi-automa...
متن کاملMorfessor in the Morpho Challenge
In this work, Morfessor, a morpheme segmentation model and algorithm developed by the organizers of the Morpho Challenge, is outlined and references are made to earlier work. Although Morfessor does not take part in the official Challenge competition, we report experimental results for the morpheme segmentation of English, Finnish and Turkish words. The obtained results are very good. Morfessor...
متن کاملInduction of the Morphology of Natural Language: Unsupervised Morpheme Segmentation with Application to Automatic Speech Recognition
In order to develop computer applications that successfully process natural language data (text and speech), one needs good models of the vocabulary and grammar of as many languages as possible. According to standard linguistic theory, words consist of morphemes, which are the smallest individually meaningful elements in a language. Since an immense number of word forms can be constructed by co...
متن کاملUnsupervised Morpheme Analysis Evaluation by a Comparison to a Linguistic Gold Standard - Morpho Challenge 2008
The goal of Morpho Challenge 2008 was to find and evaluate unsupervised algorithms that provide morpheme analyses for words in different languages. Especially in morphologically complex languages, such as Finnish, Turkish and Arabic, morpheme analysis is important for lexical modeling of words in speech recognition, information retrieval and machine translation. The evaluation in Morpho Challen...
متن کاملUnsupervised segmentation of words into morphemes – Challenge 2005 An Introduction and Evaluation Report
The objective of the challenge for the unsupervised segmentation of words into morphemes, or shorter the Morpho Challenge, was to design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. Ideally, these are basic vocabulary units suitable for different tasks, such as speech and text understanding, machine translation, inf...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004